Logo_Gett.gif

Gett, previously known as GetTaxi, is a developed technology platform solely focused on corporate Ground Transportation Management (GTM). They have an application where clients can order taxis, and drivers can accept their rides (offers). At the moment, when the client clicks the Order button in the application, the matching system searches for the most relevant drivers and offers them the order. In this task, we would like to investigate some matching metrics for orders that did not completed successfully, i.e., the customer didn't end up getting a car.

This data was taken from stratascratch platform.

Data Description We have two data sets: data_orders and data_offers, both being stored in a CSV format. The data_orders data set contains the following columns:

order_datetime - time of the order origin_longitude - longitude of the order origin_latitude - latitude of the order m_order_eta - time before order arrival order_gk - order number order_status_key - status, an enumeration consisting of the following mapping: 4 - cancelled by client, 9 - cancelled by system, i.e., a reject is_driver_assigned_key - whether a driver has been assigned cancellation_time_in_seconds - how many seconds passed before cancellation The data_offers data set is a simple map with 2 columns:

order_gk - order number, associated with the same column from the orders data set offer_id - ID of an offer

In [ ]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import plotly
import plotly.graph_objs as go
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.io as pio
import h3
pio.renderers.default='notebook'

data_orders = pd.read_csv('data_orders.csv')

data_orders.head()
Out[ ]:
order_datetime origin_longitude origin_latitude m_order_eta order_gk order_status_key is_driver_assigned_key cancellations_time_in_seconds
0 18:08:07 -0.978916 51.456173 60.0 3000583041974 4 1 198.0
1 20:57:32 -0.950385 51.456843 NaN 3000583116437 4 0 128.0
2 12:07:50 -0.969520 51.455544 477.0 3000582891479 4 1 46.0
3 13:50:20 -1.054671 51.460544 658.0 3000582941169 4 1 62.0
4 21:24:45 -0.967605 51.458236 NaN 3000583140877 9 0 NaN
In [ ]:
data_offers = pd.read_csv('data_offers.csv')

data_offers.head()
Out[ ]:
order_gk offer_id
0 3000579625629 300050936206
1 3000627306450 300052064651
2 3000632920686 300052408812
3 3000632771725 300052393030
4 3000583467642 300051001196
In [ ]:
all_offers = data_offers[data_offers['order_gk']==3000583116437]
all_offers
Out[ ]:
order_gk offer_id
3302 3000583116437 300050986179
142592 3000583116437 300050986174
315434 3000583116437 300050986180

Info

In [ ]:
data_offers.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 334363 entries, 0 to 334362
Data columns (total 2 columns):
 #   Column    Non-Null Count   Dtype
---  ------    --------------   -----
 0   order_gk  334363 non-null  int64
 1   offer_id  334363 non-null  int64
dtypes: int64(2)
memory usage: 5.1 MB

Cleaning Data: Empty Cells

In [ ]:
data_offers[data_offers.isnull().any(axis=1)].count()
Out[ ]:
order_gk    0
offer_id    0
dtype: int64

Build up distribution of orders according to reasons for failure: cancellations before and after driver assignment, and reasons for order rejection. Analyse the resulting plot. Which category has the highest number of orders?

In [ ]:
# data_orders:
# доб. колонку is_offers(data_offers кол-во по order_gk) 
df_count = data_offers.groupby(['order_gk']).count().reset_index()
df_count.rename(columns={'offer_id': 'offer_count'}, inplace=True)
merged = pd.concat(
    objs=(iDF.set_index('order_gk') for iDF in (data_orders, df_count)),
    axis=1, 
    join='inner'
).reset_index()
# доб. колонку категорию по условию, order_status_key, is_driver_assigned_key (4 кат) - "Category"
conditions = [(merged['order_status_key'] == 4) & (merged['is_driver_assigned_key'] == 1),
              (merged['order_status_key'] == 4) & (merged['is_driver_assigned_key'] == 0),
              (merged['order_status_key'] == 9) & (merged['is_driver_assigned_key'] == 1),
              (merged['order_status_key'] == 9) & (merged['is_driver_assigned_key'] == 0)]
choices = ['by client, after driver assigment', 'by client, before driver assigment', 'by system, after driver assigment', 'by system, before driver assigment']
merged['Category'] = np.select(conditions, choices, default='black')
# Составить график распределения - гистограмма - 1 колонка

groupby_category = merged.groupby(['Category'])['order_gk'].count().reset_index()
groupby_category.rename(columns={'order_gk': 'cancellations'}, inplace=True)
x_labels=[]
for label in groupby_category['Category']:
    x_labels.append(label.replace(', ','<br>'))
fig_cancel_category = px.bar(groupby_category, x='Category', y='cancellations',
             color_discrete_sequence=['lightseagreen'], text_auto=True)
fig_cancel_category.update_layout(title_text="Cancellations per Category", title_x=0.5, xaxis={'categoryorder':'total descending'})
fig_cancel_category.update_xaxes(tickvals=groupby_category['Category'], ticktext=x_labels)
fig_cancel_category.update_xaxes(type='category')
fig_cancel_category.show()

Plot the distribution of failed orders by hours. Is there a trend that certain hours have an abnormally high proportion of one category or another? What hours are the biggest fails? How can this be explained?

In [ ]:
# data_orders: group by hour(order_datetime), кол-во отказов по category - category - колоннки
merged['order_datetime'] = pd.to_datetime(merged['order_datetime'], format='%H:%M:%S')
merged['hour'] = merged['order_datetime'].dt.hour
groupby_hour = merged.groupby(['hour','Category'])['order_gk'].count().reset_index()
groupby_hour.rename(columns={'order_gk': 'cancellations'}, inplace=True)
# Получить данные в группировке: час / категории - значения в категории - это кол-во отказов 
stacked_df = groupby_hour.pivot_table(values='cancellations',index='hour',columns='Category').fillna(0)

# Составить график распределения - stacked bar - df2.plot.barh(stacked=True) 
# stacked_df.plot.bar(stacked=True, rot=0) 
fig_cancel_hour = px.bar(groupby_hour, x='hour', y='cancellations', color='Category',
             color_discrete_sequence=['lightsalmon', 'lightseagreen',
                'lightskyblue', 'lightslategray'])
fig_cancel_hour.update_layout(title_text="Cancellations per hour", title_x=0.3)
fig_cancel_hour.update_xaxes(type='category')
fig_cancel_hour.show()

Plot the average time to cancellation with and without driver, by the hour. If there are any outliers in the data, it would be better to remove them. Can we draw any conclusions from this plot?

In [ ]:
# data_orders: find outliners in cancellation_time_in_seconds
# Разделим на две дата фрейм - с водителем/нет
# Оставить только hour(order_datetime) и 'cancellation_time_in_seconds'
df_with_driver = merged[merged['is_driver_assigned_key'] == 1]
df_with_driver_cancel = df_with_driver[['hour', 'cancellations_time_in_seconds']]
df_without_driver = merged[merged['is_driver_assigned_key'] == 0]
df_without_driver_cancel = df_without_driver[['hour', 'cancellations_time_in_seconds']]

# Преобразуем для графика
df_with_driver_pivot = df_with_driver_cancel.pivot_table(values='cancellations_time_in_seconds', index=df_with_driver_cancel.index, columns='hour', aggfunc='first')
df_without_driver_pivot = df_without_driver_cancel.pivot_table(values='cancellations_time_in_seconds', index=df_without_driver_cancel.index, columns='hour', aggfunc='first')
df_without_driver_pivot.describe()
Out[ ]:
hour 0 1 2 3 4 5 6 7 8 9 ... 14 15 16 17 18 19 20 21 22 23
count 218.000000 155.000000 164.000000 178.000000 33.000000 13.000000 48.000000 131.000000 336.000000 119.000000 ... 81.000000 148.000000 94.000000 162.000000 108.000000 67.000000 163.000000 365.000000 231.000000 267.000000
mean 95.756881 91.283871 112.335366 110.050562 81.636364 113.230769 139.270833 117.977099 111.446429 117.025210 ... 85.901235 106.297297 96.053191 88.512346 68.074074 72.552239 98.239264 116.526027 88.359307 97.707865
std 89.814969 69.392705 88.731917 83.632994 77.545075 74.605802 189.925040 89.517292 93.357075 107.378742 ... 102.675777 118.710902 140.975508 90.154542 61.304923 91.126001 78.511762 89.357828 78.830340 89.700194
min 5.000000 5.000000 4.000000 6.000000 5.000000 9.000000 5.000000 5.000000 3.000000 6.000000 ... 4.000000 4.000000 7.000000 4.000000 7.000000 6.000000 4.000000 5.000000 5.000000 4.000000
25% 28.250000 36.000000 49.750000 40.250000 12.000000 68.000000 34.000000 50.000000 47.750000 43.500000 ... 14.000000 31.750000 23.500000 31.500000 20.000000 13.000000 34.000000 51.000000 28.000000 30.000000
50% 72.500000 77.000000 95.500000 89.500000 53.000000 109.000000 102.000000 95.000000 91.000000 89.000000 ... 65.000000 82.000000 59.000000 70.500000 44.000000 50.000000 82.000000 106.000000 69.000000 77.000000
75% 141.500000 122.000000 168.000000 165.000000 148.000000 177.000000 180.250000 183.500000 166.750000 169.500000 ... 120.000000 148.500000 117.750000 124.000000 96.750000 91.500000 153.500000 162.000000 122.500000 151.000000
max 833.000000 343.000000 640.000000 592.000000 234.000000 250.000000 1179.000000 467.000000 725.000000 735.000000 ... 782.000000 930.000000 1035.000000 862.000000 243.000000 557.000000 368.000000 831.000000 625.000000 723.000000

8 rows × 24 columns

In [ ]:
# Function to remove outlier 
def removeOutliers(df):
    for (columnName, columnData) in df.items():
        Q1 = columnData.quantile(0.25)
        Q3 = columnData.quantile(0.75)
        IQR = Q3 - Q1

    # identify outliers
        threshold = 1.5
        outliers = df[(df[columnName] < Q1 - threshold * IQR) | (df[columnName] > Q3 + threshold * IQR)]

        df = df.drop(outliers.index)
    
    return df
In [ ]:
# Remove outlier
df_with_driver_clean = removeOutliers(df_with_driver_pivot)
df_without_driver_clean = removeOutliers(df_without_driver_pivot)
In [ ]:
df_with_driver_clean.describe()
Out[ ]:
hour 0 1 2 3 4 5 6 7 8 9 ... 14 15 16 17 18 19 20 21 22 23
count 84.000000 72.000000 70.000000 57.000000 30.000000 20.000000 42.000000 138.000000 235.000000 130.000000 ... 77.000000 114.000000 95.000000 110.000000 97.000000 88.000000 89.000000 81.000000 102.000000 110.000000
mean 202.488095 195.166667 290.871429 237.298246 149.033333 149.700000 135.833333 149.543478 113.446809 148.015385 ... 141.974026 139.438596 138.568421 147.281818 177.701031 140.147727 151.685393 174.148148 177.117647 207.963636
std 174.277683 194.603751 292.311226 254.335476 159.291423 160.023715 149.869686 135.180988 86.798423 143.566746 ... 134.062728 121.472458 144.318181 153.584817 186.717097 127.893323 152.862505 171.203980 178.329725 211.009019
min 14.000000 9.000000 16.000000 11.000000 10.000000 11.000000 14.000000 6.000000 13.000000 12.000000 ... 17.000000 9.000000 8.000000 15.000000 13.000000 17.000000 13.000000 7.000000 11.000000 10.000000
25% 48.750000 45.000000 52.000000 50.000000 33.250000 26.750000 38.000000 45.250000 45.000000 45.000000 ... 45.000000 49.000000 43.500000 42.750000 53.000000 46.000000 43.000000 53.000000 41.500000 47.250000
50% 157.000000 132.000000 180.000000 98.000000 77.500000 86.500000 72.000000 108.500000 90.000000 87.000000 ... 70.000000 100.000000 73.000000 79.000000 79.000000 84.000000 80.000000 99.000000 99.500000 142.500000
75% 295.000000 268.000000 478.250000 394.000000 212.250000 228.500000 158.000000 219.500000 158.500000 199.250000 ... 196.000000 184.750000 186.000000 214.750000 226.000000 194.000000 222.000000 281.000000 254.500000 336.000000
max 679.000000 753.000000 1146.000000 830.000000 630.000000 526.000000 576.000000 570.000000 390.000000 622.000000 ... 543.000000 518.000000 555.000000 624.000000 761.000000 487.000000 581.000000 689.000000 646.000000 833.000000

8 rows × 24 columns

In [ ]:
# Average time for cancellations 
df_with_driver_mean = df_with_driver_clean.mean().to_frame().reset_index()
df_with_driver_mean.rename(columns={0: 'mean_cancellations_time'}, inplace=True)
df_without_driver_mean = df_without_driver_clean.mean().to_frame().reset_index()
df_without_driver_mean.rename(columns={0: 'mean_cancellations_time'}, inplace=True)
In [ ]:
# Barchart with_driver
fig_mean_cancel_time = go.Figure()
fig_mean_cancel_time.add_bar(x=df_with_driver_mean['hour'], y=df_with_driver_mean['mean_cancellations_time'], 
                             name="with driver", marker_color='lightseagreen')
fig_mean_cancel_time.add_bar(x=df_without_driver_mean['hour'], y=df_without_driver_mean['mean_cancellations_time'], 
                             name="without driver", marker_color='lightslategray')
fig_mean_cancel_time.update_layout(title_text="Mean cancellations time per hour", title_x=0.5)
fig_mean_cancel_time.update_xaxes(type='category')
fig_mean_cancel_time.show()

Plot the distribution of average ETA by hours. How can this plot be explained?

In [ ]:
# data_orders: group by hour(order_datetime),  - значение ['m_order_eta'].mean() 
# Преобразуем для графика

df_with_driver_eta = df_with_driver[['hour', 'm_order_eta']]
df_with_driver_eta_pivot = df_with_driver_eta.pivot_table(values='m_order_eta', index=df_with_driver.index, columns='hour', aggfunc='first')
In [ ]:
# Remove Outliers
df_with_driver_eta_clean = removeOutliers(df_with_driver_eta_pivot)
In [ ]:
# Average time for ETA 
df_with_driver_eta_mean = df_with_driver_eta_clean.mean().to_frame().reset_index()
df_with_driver_eta_mean.rename(columns={0: 'mean_eta'}, inplace=True)
In [ ]:
# Barchart with_driver
fig_mean_eta = px.bar(df_with_driver_eta_mean, x='hour', y='mean_eta',
             color_discrete_sequence=['lightseagreen'])
fig_mean_eta.update_layout(title_text="Mean eta", title_x=0.5)
fig_mean_eta.update_xaxes(type='category')
fig_mean_eta.show()
In [ ]:
df_scatter = data_orders
fig = px.scatter(df_scatter, x="origin_longitude", y="origin_latitude", color='m_order_eta')
fig.show()
In [ ]:
df_scatter2 = merged
fig = px.scatter(df_scatter2, x="origin_longitude", y="origin_latitude", color='Category')
fig.show()